首页> 外文OA文献 >Effects of Information and Machine Learning Algorithms on Word Sense Disambiguation with Small Datasets
【2h】

Effects of Information and Machine Learning Algorithms on Word Sense Disambiguation with Small Datasets

机译:信息和机器学习算法对小数据集词义消歧的影响

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several follow-up evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.
机译:当前的消除单词歧义的方法使用(并且经常结合使用)各种机器学习技术。大多数引用歧义及其周围单词的特征,并且基于数千个示例。不幸的是,开发大型训练集很麻烦,并且为了应对这一挑战,我们研究了将符号知识用于小型数据集的情况。一个朴素的贝叶斯分类器训练了15个单词,每个单词有100个示例。分配给句子中找到的概念的统一医学语言系统(UMLS)语义类型以及这些语义类型之间的关系构成了知识库。单词的最常使用感作为基准。在九个实验条件下评估了越来越精确的符号知识的影响。通过基于10倍交叉验证的准确性来衡量性能。最佳条件仅使用句子中单词的语义类型。准确度平均比基准高出10%;但是,从8%的恶化到29%的改善。为了调查这种大的差异,我们进行了几次后续评估,测试了其他算法(决策树和神经网络)和金标准(每位专家),但结果没有显着差异。但是,我们注意到一种趋势,即对人类评估人员来说麻烦最小的单词会被最好地消除歧义。我们得出的结论是,算法和人类行为都不会引起这些巨大差异,但是UMLS元同义词库的结构(用于表示歧义词的感觉)会导致金标准不准确,从而导致词义消歧技术的性能有所不同。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号